The word entropy of natural languages

نویسندگان

  • Christian Bentz
  • Dimitrios Alikaniotis
چکیده

The average uncertainty associated with words is an informationtheoretic concept at the heart of quantitative and computational linguistics. The entropy has been established as a measure of this average uncertainty also called average information content. We here use parallel texts of 21 languages to establish the number of tokens at which word entropies converge to stable values. These convergence points are then used to select texts from a massively parallel corpus, and to estimate word entropies across more than 1000 languages. Our results help to establish quantitative language comparisons, to understand the performance of multilingual translation systems, and to normalize semantic similarity measures.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Entropy of Words - Learnability and Expressivity across More than 1000 Languages

The choice associated with words is a fundamental property of natural languages. It lies at the heart of quantitative linguistics, computational linguistics, and language sciences more generally. Information-theory gives us tools at hand to measure precisely the average amount of choice associated with words – the word entropy. Here we use three parallel corpora – encompassing ca. 450 million w...

متن کامل

Complexity measurement of natural and artificial languages

We compared entropy for texts written in natural languages (English, Spanish) and artificial languages (computer software) based on a simple expression for the entropy as a function of message length and specific word diversity. Code text written in artificial languages showed higher entropy than text of similar length expressed in natural languages. Spanish texts exhibit more symbolic diversit...

متن کامل

Word-Forming Process in Azeri Turkish Language

The subject intended to study the general methods of natural word-forming in Azeri Turkish language. This study aimed to reach this purpose by analyzing the construction of compound Azeri Turkish words. Same’ei (2016) did a comprehensive study on word-forming process in Farsi, which was the inspiration source of this study for Azeri Turkish language word-forming. Numerous scholars had done vari...

متن کامل

Discovery of Kolmogorov Scaling in the Natural Language

Abstract: We consider the rate R and variance σ2 of Shannon information in snippets of text based on word frequencies in the natural language. We empirically identify Kolmogorov’s scaling law in σ2 ∝ k−1.66±0.12 (95% c.l.) as a function of k = 1/N measured by word count N. This result highlights a potential association of information flow in snippets, analogous to energy cascade in turbulent ed...

متن کامل

Universal Entropy of Word Ordering Across Linguistic Families

BACKGROUND The language faculty is probably the most distinctive feature of our species, and endows us with a unique ability to exchange highly structured information. In written language, information is encoded by the concatenation of basic symbols under grammatical and semantic constraints. As is also the case in other natural information carriers, the resulting symbolic sequences show a deli...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1606.06996  شماره 

صفحات  -

تاریخ انتشار 2016